Automatically identifying gene/protein terms in MEDLINE abstracts

نویسندگان

  • Hong Yu
  • Vasileios Hatzivassiloglou
  • Andrey Rzhetsky
  • W. John Wilbur
چکیده

MOTIVATION Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation. RESULTS GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts. AVAILABILITY A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. [email protected]. Voice: 212-939-7028; fax: 212-666-0140.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

2 Automatically identifying gene / protein terms in MEDLINE abstracts

10 Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer11 readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a 12 necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein–protein in13 teractions...

متن کامل

Automatic extraction of gene and protein synonyms from MEDLINE and journal articles

Genes and proteins are often associated with multiple names, and more names are added as new functional or structural information is discovered. Because authors often alternate between these synonyms, information retrieval and extraction benefits from identifying these synonymous names. We have developed a method to extract automatically synonymous gene and protein names from MEDLINE and journa...

متن کامل

Comparative experiments on learning information extractors for proteins and their interactions

OBJECTIVE Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for...

متن کامل

Tagging gene and protein names in biomedical text

MOTIVATION The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This ...

متن کامل

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of biomedical informatics

دوره 35 5-6  شماره 

صفحات  -

تاریخ انتشار 2002